Goto

Collaborating Authors

 user comment



SARC: Sentiment-Augmented Deep Role Clustering for Fake News Detection

Wang, Jingqing, Shang, Jiaxing, Xu, Rong, Hao, Fei, Huang, Tianjin, Min, Geyong

arXiv.org Artificial Intelligence

Fake news detection has been a long-standing research focus in social networks. Recent studies suggest that incorporating sentiment information from both news content and user comments can enhance detection performance. However, existing approaches typically treat sentiment features as auxiliary signals, overlooking role differentiation, that is, the same sentiment polarity may originate from users with distinct roles, thereby limiting their ability to capture nuanced patterns for effective detection. To address this issue, we propose SARC, a Sentiment-Augmented Role Clustering framework which utilizes sentiment-enhanced deep clustering to identify user roles for improved fake news detection. The framework first generates user features through joint comment text representation (with BiGRU and Attention mechanism) and sentiment encoding. It then constructs a differentiable deep clustering module to automatically categorize user roles. Finally, unlike existing approaches which take fake news label as the unique supervision signal, we propose a joint optimization objective integrating role clustering and fake news detection to further improve the model performance. Experimental results on two benchmark datasets, RumourEval-19 and Weibo-comp, demonstrate that SARC achieves superior performance across all metrics compared to baseline models. The code is available at: https://github.com/jxshang/SARC.


A Multi-Task Benchmark for Abusive Language Detection in Low-Resource Settings

Gaim, Fitsum, Song, Hoyun, Lee, Huije, Ko, Changgeon, Hwang, Eui Jun, Park, Jong C.

arXiv.org Artificial Intelligence

Content moderation research has recently made significant advances, but remains limited in serving the majority of the world's languages due to the lack of resources, leaving millions of vulnerable users to online hostility. This work presents a large-scale human-annotated multi-task benchmark dataset for abusive language detection in Tigrinya social media with joint annotations for three tasks: abusiveness, sentiment, and topic classification. The dataset comprises 13,717 YouTube comments annotated by nine native speakers, collected from 7,373 videos with a total of over 1.2 billion views across 51 channels. We developed an iterative term clustering approach for effective data selection. Recognizing that around 64% of Tigrinya social media content uses Romanized transliterations rather than native Ge'ez script, our dataset accommodates both writing systems to reflect actual language use. We establish strong baselines across the tasks in the benchmark, while leaving significant challenges for future contributions. Our experiments demonstrate that small fine-tuned models outperform prompted frontier large language models (LLMs) in the low-resource setting, achieving 86.67% F1 in abusiveness detection (7+ points over best LLM), and maintain stronger performance in all other tasks. The benchmark is made public to promote research on online safety.


From Keywords to Clusters: AI-Driven Analysis of YouTube Comments to Reveal Election Issue Salience in 2024

Simoes, Raisa M., Kelly, Timoteo, Simoes, Eduardo J., Rao, Praveen

arXiv.org Artificial Intelligence

Abstract: This paper aims to explore two compet ing data science meth odologies to attempt answer ing th e question, " Which issues contributed most to voters' choice in the 2024 presidential election? " The methodologies involve novel empirical evidence driven by artificial intelligence (AI) techniques . By using two distinct methods based on natural language processing and clustering analysis to mine over eight thousand user comments on election - related YouTube videos from one right leaning journal, Wall Street Journal, and one left leaning journal, New York Times, during pre - election week, we quantify the frequency of selected issue areas among user comments to infer which issues were most salient to potential voters in the seven days preceding the November 5th election. Empirically, we primarily demonstrate that immigration and democracy were the most frequently and consistently invoked issues in user comments on the analyzed YouTube videos, followed by the issue of identity politics, while inflation was significantly less frequently referenced. These results corroborate certain findings of post - election surveys but also refute the supposed importance of inflation as an election issue. This indicate s that variations on opinion mining, with their analysis of raw user data online, ca n be more revealing than polling and surveys for analyzing election outcomes. Keywords: artificial intelligence; opinion mining; clustering; vot e choice; cleavages 1. Introduction The Democrats lost both houses of Congress and the Presidency to Republicans in the 2024 election, with former president Donald Trump winning all seven swing states and the national popular vote, despite most pre - election polls giving Vice President Kamala Harris and President Trump a roughly equal chance of winning . Most post - election punditry and analysis in the legacy press and alternative media has attributed the Democrats' large loss to two main issues - inflation [59] and immigration [30] However, a growing contingent of analysts has also attributed the election outcome to the Democratic party's association with cultural issues purportedly distant from the median voter's preferences, such as th ose alternatively aggregated under the concept of "identity" or " woke " politics [54, 56] . To this point, three post - election studies illustrate how voters associated Democrats with left - of - center ideas that were ostensibly distant from most voters' priorities. S urvey research from the think tank Third Way demonstrates that Democrats, and thus Kamala Harris, were largely perceived as "too liberal" [15], while a study from More In Common polling over 5, 000 Americans concluded that while inflation was the top concern for every major demographic group across both parties, Americans misperceived LGBT/transgender policies as the top policy priority for Democrats [37] .


Digital Voices of Survival: From Social Media Disclosures to Support Provisions for Domestic Violence Victims

Wang, Kanlun, Fu, Zhe, Xin, Wangjiaxuan, Zhou, Lina, Chandrappa, Shashi Kiran

arXiv.org Artificial Intelligence

Domestic Violence (DV) is a pervasive public health problem characterized by patterns of coercive and abusive behavior within intimate relationships. With the rise of social media as a key outlet for DV victims to disclose their experiences, online self - di sclosure has emerged as a critical yet underexplored avenue for support - seeking. In addition, existing research lacks a comprehensive and nuanced understanding of DV self - disclosure, support provisions, and their connections. To address these gaps, this study proposes a novel c omputational framework for modeling DV support - seeking behavior alongside community support mechanisms. The framework consists of four key components: self - disclosure detection, post clustering, topic summarization, and support extraction and mapping . We implement and evaluate the framework with data collected from relevant social media communities. Our findings not only advance existing knowledge on DV self - disclosure and online support provisions but also enable victim - centered digital interventions.



Bridging Subjective and Objective QoE: Operator-Level Aggregation Using LLM-Based Comment Analysis and Network MOS Comparison

Panahi, Parsa Hassani Shariat, Jalilvand, Amir Hossein, Najafi, M. Hassan

arXiv.org Artificial Intelligence

This paper introduces a dual-layer framework for network operator-side quality of experience (QoE) assessment that integrates both objective network modeling and subjective user perception extracted from live-streaming platforms. On the objective side, we develop a machine learning model trained on mean opinion scores (MOS) computed via the ITU-T P.1203 reference implementation, allowing accurate prediction of user-perceived video quality using only network parameters such as packet loss, delay, jitter, and throughput without reliance on video content or client-side instrumentation. On the subjective side, we present a semantic filtering and scoring pipeline that processes user comments from live streams to extract performance-related feedback. A large language model is used to assign scalar MOS scores to filtered comments in a deterministic and reproducible manner. To support scalable and interpretable analysis, we construct a labeled dataset of 47,894 live-stream comments, of which about 34,000 are identified as QoE-relevant through multi-layer semantic filtering. Each comment is enriched with simulated Internet Service Provider attribution and temporally aligned using synthetic timestamps in 5-min intervals. The resulting dataset enables operator-level aggregation and time-series analysis of user-perceived quality. A delta MOS metric is proposed to measure each Internet service provider's deviation from platform-wide sentiment, allowing detection of localized degradations even in the absence of direct network telemetry. A controlled outage simulation confirms the framework's effectiveness in identifying service disruptions through comment-based trends alone. The system provides each operator with its own subjective MOS and the global platform average per interval, enabling real-time interpretation of performance deviations and comparison with objective network-based QoE estimates.


Retrieval-Confused Generation is a Good Defender for Privacy Violation Attack of Large Language Models

Peng, Wanli, Chen, Xin, Fu, Hang, He, XinYu, Yiming, Xue, Wen, Juan

arXiv.org Artificial Intelligence

Recent advances in large language models (LLMs) have made a profound impact on our society and also raised new security concerns. Particularly, due to the remarkable inference ability of LLMs, the privacy violation attack (PVA), revealed by Staab et al., introduces serious personal privacy issues. Existing defense methods mainly leverage LLMs to anonymize the input query, which requires costly inference time and cannot gain satisfactory defense performance. Moreover, directly rejecting the PVA query seems like an effective defense method, while the defense method is exposed, promoting the evolution of PVA. In this paper, we propose a novel defense paradigm based on retrieval-confused generation (RCG) of LLMs, which can efficiently and covertly defend the PVA. We first design a paraphrasing prompt to induce the LLM to rewrite the "user comments" of the attack query to construct a disturbed database. Then, we propose the most irrelevant retrieval strategy to retrieve the desired user data from the disturbed database. Finally, the "data comments" are replaced with the retrieved user data to form a defended query, leading to responding to the adversary with some wrong personal attributes, i.e., the attack fails. Extensive experiments are conducted on two datasets and eight popular LLMs to comprehensively evaluate the feasibility and the superiority of the proposed defense method.


COSMMIC: Comment-Sensitive Multimodal Multilingual Indian Corpus for Summarization and Headline Generation

Kumar, Raghvendra, Salman, S. A. Mohammed, Sahu, Aryan, Nandi, Tridib, P., Pragathi Y., Saha, Sriparna, Moreno, Jose G.

arXiv.org Artificial Intelligence

Despite progress in comment-aware multimodal and multilingual summarization for English and Chinese, research in Indian languages remains limited. This study addresses this gap by introducing COSMMIC, a pioneering comment-sensitive multimodal, multilingual dataset featuring nine major Indian languages. COSMMIC comprises 4,959 article-image pairs and 24,484 reader comments, with ground-truth summaries available in all included languages. Our approach enhances summaries by integrating reader insights and feedback. We explore summarization and headline generation across four configurations: (1) using article text alone, (2) incorporating user comments, (3) utilizing images, and (4) combining text, comments, and images. To assess the dataset's effectiveness, we employ state-of-the-art language models such as LLama3 and GPT-4. We conduct a comprehensive study to evaluate different component combinations, including identifying supportive comments, filtering out noise using a dedicated comment classifier using IndicBERT, and extracting valuable insights from images with a multilingual CLIP-based classifier. This helps determine the most effective configurations for natural language generation (NLG) tasks. Unlike many existing datasets that are either text-only or lack user comments in multimodal settings, COSMMIC uniquely integrates text, images, and user feedback. This holistic approach bridges gaps in Indian language resources, advancing NLP research and fostering inclusivity.


Reddit sues AI company Anthropic for allegedly 'scraping' user comments to train chatbot

The Guardian

The social media platform Reddit has sued the artificial intelligence company Anthropic, alleging that it is illegally "scraping" the comments of Reddit users to train its chatbot Claude. Reddit claims that Anthropic has used automated bots to access the social network's content despite being asked not to do so, and "intentionally trained on the personal data of Reddit users without ever requesting their consent". Anthropic did not immediately return a request for comment. The claim was filed on Wednesday in the superior court of California in San Francisco. "AI companies should not be allowed to scrape information and content from people without clear limitations on how they can use that data," said Ben Lee, Reddit's chief legal officer, in a statement on Wednesday.